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Abstract 

In  this  study  we  develop  and  test  a  strategy  for  selectively  sizing  (multiplying  by  an 
appropriate  scalar)  the  approximate  Hessian  matrix  before  it  is  updated  in  the  BFGS 
and  DFP  trust-region  methods  for  unconstrained  optimization.  Our  numerical  results 
imply  that  for  use  with  the  DFP  update  the  Oren-Luenberger  sizing  factor  is  completely 
satisfactory  and  selective  sizing  is  vastly  superior  to  the  alternatives  of  never  sizing, 
or  first-iteration  sizing,  and  is  slightly  better  than  the  alternative  of  alwavs  sizing. 
Numerical  experimentation  showed  that  the  Oren-Luenberger  sizing  factor  is  not  a 
satisfactory  sizing  factor  for  use  with  the  BFGS  update.  Therefore,  based  on  our  newly 
acquired  understanding  of  the  situation,  we  propose  a  damped  Oren-Luenberger  sizing 
factor  to  be  used  with  the  BFGS  update.  Our  numerical  experimentation  implies  that 
selectively  sizing  the  BFGS  update  with  the  damped  Oren-Luenberger  sizing  factor  is 
superior  to  the  alternatives.  These  results  contradict  the  folk-axiom  that  sizing  should 
be  done  onlv  at  the  first  iteration.  They  also  show  that  without  sufficient  sizing,  DFP 
is  vastlv  inferior  to  BFGS;  however,  when  selectively  sized,  DFP  is  competitive  with 
BFGS." 
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1  Introduction 


In  this  note  we  study  the  effect  that  sizing  (multiplying  the  approximate  Hessian  by  an 
appropriate  scalar  before  it  is  updated)  has  on  the  performance  of  the  BFGS  and  DFP 
trust-region  methods  for  unconstrained  optimization.  We  suggest  sizing  strategies  for  both 
these  updates  and  include  considerable  numerical  experimentation  that  indicates  that  our 
selective  sizing  is  superior  to  the  alternatives  of  never  sizing,  sizing  only  at  the  first  iteration, 
or  sizing  at  every  iteration.  Our  experimentation  also  supports  the  common  belief  that  sizing 
is  more  critical  when  using  the  DFP  update  than  it  is  when  using  the  BFGS  update.  In  fact 
we  were  pleasantly  surprised  to  see  that  in  our  experiments,  when  the  DFP  algorithm  was 
sized  according  to  our  strategy,  it  performed  numerically  as  well  as  the  BFGS  algorithm. 

In  the  remainder  of  this  section  we  present  some  history  and  preliminaries  concerning 
the  notion  of  sizing.  In  Section  2  we  consider  some  interesting  properties  of  the  Oren- 
Luenberger  sizing  factor.  Our  numerical  studies  showed  that  the  Oren-Luenberger  sizing 
factor  gives  satisfactory  performance  when  used  with  the  DFP  update,  but  not  with  the 
BFGS  update.  Hence,  in  Section  3,  guided  by  the  understanding  gained  in  Section  2  and  our 
numerical  experimentation,  we  motivate  and  propose  the  damped  Oren-Luenberger  sizing 
factor.  In  Section  4  we  describe  our  selective  sizing  strategy.  Our  numerical  results  are 
presented  in  Section  5,  and  in  Section  6  we  make  some  concluding  remarks.  We  make  the 
basic  assumption  that  the  reader  is  familiar  with  at  least  chapters  6  and  9  of  Dennis  and 
Schnabel  [6]. 

In  1974  Oren  and  Luenberger  [12]  proposed  a  class  of  secant  methods  which  they  referred 
to  as  self-scaling  variable  metric  (SSVM)  methods.  Soon  after,  Oren  and  Spedicato  [13] 
identified  a  subclass  of  the  SSVM  methods  that  had  various  desirable  properties.  Shanno  and 
Phua  [15],  among  other  things,  studied  the  BFGS  secant  update  and  its  associated  scaling  as 
a  member  of  the  Oren- Spedicato  subclass  of  SSVM  methods.  They  argued  that  the  BFGS 
update  should  be  scaled  only  at  the  first  iteration,  as  opposed  to  every  iteration  as  is  implied 
by  the  self-scaling  philosophy.  They  presented  numerical  evidence  that  showed  that  in  general 
scaling  only  the  initial  iteration  was  superior  to  scaling  at  every  iteration.  Our  numerical 
experiments  corroborated  their  findings  when  the  Oren-Luenberger  sizing  factor  was  used; 
however,  the  opposite  was  true  when  the  centered  Oren-Luenberger  sizing  factor  was  used. 
In  1981  Dennis,  Gay  and  Welsch  [4]  proposed  the  now  well-known  NL2SOL  algorithm  for 
the  nonlinear  1  east -squares  problem.  They  incorporated  several  novel  features  into  their 
algorithm.  The  basic  framework  consists  of  a  Gauss-Newton  trust-region  method.  In  order 
to  handle  large  residual  problems  they  maintain  a  structured  BFGS  secant  approximation 
to  5*,  the  second-order  part  of  the  least-squares  Hessian,  and  adaptively  decide  when  to 
use  this  approximation.  One  of  their  major  concerns  was  that  since  secant  methods  do 
not  generate  approximations  that  become  arbitrarily  accurate  as  the  iteration  proceeds,  the 
approximation  to  this  second-order  information  may  impede  the  fast  convergence  for  zero 
residual  problems,  i.e.,  the  second-order  information  may  be  zero  at  the  solution,  but  the 
approximations  5*  may  not  converge  to  zero.  This  deficiency  was  overcome  in  the  following 
clever  manner.  First,  they  observed  that  the  Oren-Luenberger  scaling  for  the  BFGS  update, 
say  7*,  had  the  property  that  the  interval  spectrum  of  7 kSk  essentially  overlapped  the  interval 
spectrum  of  the  matrix  of  the  exact  second-order  information.  Hence,  by  scaling  by  7*  one 
could  guarantee  that  7*5*  would  converge  to  zero  when  the  second-order  information  was 
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zero  at  the  solution.  In  this  application  it  was  critical  that  the  scaling  was  done  at  each 
iteration;  unlike  the  Shanno-Phua  applications.  Very  recently,  Huschens  [8]  utilizing  the 
Dennis-Martinez-Tapia  BFGS  structure  principle  [5],  has  given  a  most  elegant  solution  to 
this  problem.  In  order  to  emphasize  this  spectrum  shifting  property,  Dennis,  Gay  and  Welsch 
decided  to  call  the  process  of  multiplying  Sk  by  7*  “sizing”  as  opposed  to  scaling.  We  will 
follow  Dennis,  Gay  and  W’elsch  in  this  usage  and  actually  rigorously  define  what  we  perceive 
this  notion  to  be.  We  begin  with  the  following  definitions. 

Definition  1.1  By  the  convex  spectrum  of  a  collection  ofnxn  matrices  Au...,Am  denoted 
conspectrum(  A\,...,Am  ),  we  mean  the  convex  hull  of  the  eigenvalues  of  A\, . . . ,  Am. 

When  A  is  symmetric  the  convex  spectrum  of  A  is  an  interval,  and  we  therefore  also  refer 
to  it  as  the  interval  spectrum  of  A. 

Definition  1.2  We  say  that  the  scalar  7  sizes  B  €  Rnxn  relative  to  Au  . . . ,  Am  <E  RnXn  if 
conspectrum(~f B)Oconspectrum(A\ , . . . ,  ^4m)  ^  4>  ■ 

The  following  proposition  illustrates  the  notion  of  sizing.  It  is  easily  extended  to  the 
general  case: 

Proposition  1.1  Let  A,B  €  RnXn  be  symmetric  matrices.  Then  the  scalar  7  sizes  B  relative 
to  A  if  and  only  if  there  exists  u,v  E  Rn  such  that 

=  (U) 

uTu  vTBv 

Proof.  The  proof  follows  directly  from  well-known  properties  of  the  Rayleigh  quotient.  □ 

Our  objective  is  to  size  the  approximate  Hessian  relative  to  the  true  Hessian  before  the 
update  is  made  in  a  secant  method.  Our  formal  definition  of  this  notion  requires  some 
motivation. 

Recall  that  by  a  secant  method  for  approximating  the  minimizer  of  the  C 2  function 
/  :  Rn  — ►  R  we  mean  the  iterative  procedure 

xfc+i  =  xk-B?Vf{xk),  k  =  0,1,...  (1.2) 

where  Bk  is  an  approximation  to  V2/(xt)  and  Bk+i  =  U(s,y,  Bk),  the  update  of  Bk ,  satisfies 
the  secant  equation 

Bk+is  =  y  (L3) 

with  s  =  Xfc+i  —  xjt  and  y  =  V/(xfc+i)  —  V/(xfc). 

Without  evaluating  additional  functions  or  gradients,  the  only  Hessian  information  we 
have  is  y,  and  this  is  only  approximate  information.  Specifically,  y  is  an  approximation 
to  V2/(x)s.  Moreover,  since  y  is  a  vector  and  the  standard  mean- value- theorem  does  not 
necessarily  hold  for  vector-valued  functions,  we  do  not  know  that  there  exists  x  €  R  such 
that  y  =  V2/(x)s.  However,  McLeod’s  mean-value  theorem  [9]  for  vector-valued  functions 
says  that  there  exist  xj,X2,...,xn  E  RA  such  that  y  is  contained  in  the  con\ex  hull  of 
V2/(ii)s,V2/(x2)s,...,V2/(xn)s.  Hence,  when  our  Hessian  information  comes  only  from 
y ,  it  would  be  shortsighted  of  us  to  define  sizing  in  terms  of  the  Hessian  of  /  at  just  one 
point.  We  therefore  propose  the  following  definition. 
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Definition  1.3  Consider  a  C 2  function  f  :  R71  —*  R.  We  say  that  the  scalar  7  si:es  B  € 
Rnxn  relative  to  the  Hessian  of  f  if  there  exists  xj,x2,...,xm  €  FT1  such  that 

conspectrum(~fB)C\conspcctrum(V2  f(x{),  V2/(x2), . . . ,  V2/(xm))  ^  <t>.  (1.4) 


We  call  the  integer  m  the  degree  of  the  sizing  factor  7 . 


The  Oren-Luenberger  sizing  factor  associated  with  the  BFGS  update  and  used  by  Shanno 
and  Phua  and  Dennis,  Gay,  and  Welsch  is 


7 


T 

y  s 

sT  Bs 


(1.5) 


By  the  mean-value  theorem  we  can  write  7  =  ‘  V  ^or  some  0  <  9  <  1;  hence  7 

sizes  B  relative  to  the  Hessian  of  /,  and  it  is  of  degree  1. 

It  is  worth  noting  that  as  a  direct  consequence  of  the  secant  equation  (1.3)  we  have 


T 

y  s 


sTBk+1s 


=  l; 


(1.6) 


hence  in  any  secant  method  all  approximate  Hessian  approximations,  except  the  initial  ap¬ 
proximation,  are  automatically  sized  relative  to  the  Hessian  of  /.  This  phenomenon  seem¬ 
ingly  adds  credibility  to  the  Shanno-Phua  doctrine  of  sizing  only  at  the  initial  iteration. 

Dennis  and  Wolkowicz  [7],  defer  to  Shanno  and  Phua  [15],  and  consider  some  interesting 
alternatives  to  sizing  after  the  first  step. 

Carter  [3]  in  a  very  interesting  work  presented  three  procedures  for  safeguarding  Hessian 
approximations.  All  three  procedures  selectively  made  changes  to  the  Hessian  approxima¬ 
tions.  One  of  these  procedures  (the  one  he  liked  least)  is  quite  similar  to  ours.  For  the  BFGS 
update  he  uses  as  the  sizing  factor  for  Bk 


T  T 

y s  9+9+ 
sTs  glBkg+ 


(1.7) 


where  s  =  xk+1  -  xk,  y  =  V/(xfc+1)  -  V/(xfc)  and  g+  =  V/(xfc+1)  .  Moreover,  he 
sizes  whenever  7  in  (1.7)  is  less  than  1.  He  stated  that  this  choice  of  sizing  for  the  BFGS 
method  performed  relatively  poorly  on  large  dimensional  problems.  We  experienced  a  similar 
phenomenon  with  the  Oren-Luenberger  sizing  factor  (1.5);  more  will  be  said  about  this  issue 
in  Section  3.  Observe  that  Carter’s  choice  (1.7)  sizes  Bk  relative  to  the  Hessian  of  /  and  it 
is  of  degree  1. 

Al-Baali  [1]  also  considered  selective  scaling  for  the  BFGS  update  using  various  scaling 
strategies. 


2  Oren-Luenberger  Sizing 

The  following  proposition  and  its  corollary  played  a  major  motivational  role  in  our  original 
plan  to  selectively  size  both  the  DFP  and  the  BFGS  update  with  the  Oren-Luenberger  sizing 
factor  (1-5).  In  the  proposition  we  assume  the  standard  assumptions  for  secant  method 
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theory,  see  Dennis  and  Schnabel  [6]  for  details,  and  we  consider  a  secant  method  of  the  form 
( 1 .2)— ( 1-3)  which  generates  symmetric  and  positive  definite  updates,  e.g.  the  DFP  or  the 
BFGS  secant  method. 

Proposition  2.1  If  the  secant  method  (1.2)— (1.3)  with  steplength  1  converges  q-superlinearly, 
then  the  Oren-Luenberger  sizing  factory  7*  =  ,Tgk,k  converges  to  1. 

Proof:  Suppose  that  {xjt}  converges  to  the  minimizer  x..  From  the  standard  assumptions 
we  know  that  V2/(x.)  is  positive  definite.  Let  A*  >  0  be  the  smallest,  eigenvalue  of  V2/(x.). 
In  what  follows  the  quantities  s,y,  and  B  should  all  be  viewed  with  a  subscript  k.  We  have 

yTs  sTBs 
sTs  sTs 


< 


||t/  -  Bs\\  UVV(*-)a  ~  Bs ] 


+  0(M) 


(2.1) 


Also, 


|sTV2/(x.)s  sTBstl  ^  [|V2/(x.)x  -  Bs  1| 


sTs 


sTs 


(2.2) 


The  Dennis  More  characterization  of  g-superlinear  convergence  (Theorem  8.2.4  of  Dennis 
and  Schnabel  [6])  implies  that  the  right-hand  sides  of  (2.1)  and  (2.2)^  converge  to  zero.  From 
(2.2)  and  properties  of  the  Rayleigh  quotient  we  have  that  liminf  ^  A„.  Let  us  write 


T 

y  * 

sTBs 


STs 

sTBs 


'  T 

y  s 

,  STS 


sTBs\ 


sTs 


r 


(2.3) 


T 

The  proof  now  follows  from  (2.1)  and  (2.3)  since  we  have  established  that  fTgs  is  uniformly 


bounded  in  k. 


□ 


Corollary  2.1  Proposition  2.1  holds  for  the  BFGS  secant  method  without  the  assumption 
that  the  convergence  is  q-superlinear. 

Proof:  Very  recently  Byrd,  Tapia  and  Zhang  [2]  have  demonstrated  that  under  the  standard 
assumptions  if  the  BFGS  secant  method  with  steplength  1  converges,  then  the  convergence 

is  g-superlinear.  1=1 

Armed  with  this  encouragement  we  initially  set  out  to  selectively  size  both  the  BFGS 
and  DFP  secant  trust-region  methods  using  the  Oren-Luenberger  sizing  factor.  From  the 
very  beginning  our  success  for  DFP  was  immediate  and  significant.  Even  in  the  case  when 
we  sized  at  every  iteration,  the  sizing  factor  seemed  to  be  converging  to  1.  Initially  we 
also  experienced  good  success  for  the  BFGS  trust-region  method.  Only  when  we  tried  large 
dimensional  problems,  e.g.  problems  of  dimension  greater  than  20,  did  we  conclude  that 
the  Oren-Luenberger  sizing  and  the  bFGS  secant  update  were  a  bad  fit.  For  these  larger 
problems  sizing  often  hurt  the  performance;  yet  it  often  helped  the  performance,  we  could 
not  determine  when  it  would  help  or  when  it  would  hurt.  It  is  interesting  to  note  that  Carter 
[3]  mentioned  that  he  also  experienced  poor  performance  for  the  larger  dimensional  problems 
when  he  selectively  applied  his  sizing  factor  (1.7)  to  the  BFGS  secant  method.  We  noticed 
that  in  several  examples  the  Oren-Luenberger  sizing  factor  did  not  converge  to  1.  However, 
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in  these  examples  the  trust-region  was  active  in  some  of  the  final  iterations;  hence  Corollary 
2.1  did  not  seem  to  be  violated. 

In  a  very  interesting  recent  work  Nocedal  and  Yuan  [11]  have  studied  the  asymptotic 
behavior  of  the  BFGS  secant  method  using  Oren-Luenberger  sizing.  Their  theory  implies 
that  the  sizing  factor  need  not  converge  to  1  even  if  steplength  1  is  taken.  Hence,  according 
to  Proposition  2.1,  Oren-Luenberger  sizing  may  impede  superlinear  convergence.  Whether 
such  a  theory  exists  for  the  DFP  secant  update  is  perhaps  an  interesting  open  question. 


3  Centered  Oren-Luenberger  Sizing 


We  have  argued  that  Oren-Luenberger  sizing  (1.5)  and  the  BFGS  update  are  not  an  effective 
combination.  We  also  feel  that  the  same  is  probably  true  for  Carter  (1.7)  sizing  and  the  BFGS 
update.  Since  in  trust-region  methods  one  does  not  have  access  to  B~l  we  do  not  have  the 
option  of  using  so-called  “inverse  sizing”,  i.e.,  multiplying  B  by  7  =  v  y.  It  is  of  some 
interest  to  note  that  according  to  our  formal  notion  of  sizing  (Definition  1.3),  “inverse  sizing” 
is  not  a  sizing  of  B. 

At  this  juncture  it  is  our  considered  opinion  that  for  the  BFGS  trust-region  method 
there  is  no  truly  effective  sizing  factor  of  degree  1.  Essentially  we  believe  that  a  sizing 
factor  of  degree  1  does  not  carry  enough  information  to  consistently  improve  the  overlap  of 
the  respective  spectra  for  large  dimensional  problems.  Therefore,  in  the  following  we  will 
attempt  to  construct  an  effective  sizing  factor  of  degree  greater  than  1  for  the  BFGS  update. 

Let  us  recall  that  the  objective  of  sizing  is  to  overlap  the  respective  spectra.  An  ideal 
overlap  would  match  the  center  of  the  interval  spectrum  of  B  with  the  center  of  the  interval 
spectrum  of  V2/(x).  Of  course  the  centers  of  these  interval  spectra  are  not  known.  How¬ 
ever,  given  various  points  in  the  interval  spectrum,  a  convex  combination,  in  particular  the 
axithmetic  mean  (average),  of  these  points  serves  as  an  approximation  to  the  center  of  the 
interval  spectrum.  This  is  the  central  idea  behind  the  sizing  factor  we  are  about  to  construct. 

In  what  follows  we  use  the  notation  s  =  Xk+i—ik,  s_  =  x*— ijt-i,  y  =  V/(xjt+i)— V/(x*), 
y_  =  V /(xfc)  —  V/(xfc_i).  By  the  centered  Oren-Luenberger  sizing  factor  we  mean 


7 {Ok)  = 


(I  ~  9k)(yIs-)/(Sls-)  +  ( 9k)(yTs)/(sTs ) 

(1  -  ek)(slBkS-)/(sls.)  +  (9k)(sTBks)/(s Ts) 


(3.1) 


where  0  <  0k  <  1.  Clearly  for  9k  =  1  we  obtain  the  Oren-Luenberger  sizing  factor,  and  for 
9k  =  0  we  obtain  the  constant  1.  To  see  this  recall  the  secant  equation  BkS _  =  y_.  Our 
intuition  tells  us  that  we  should  be  particularly  interested  in  7(0jt)  for  9k  =  1/2. 

The  following  proposition  shows  that  ~f(9k)  is  also  a  “softening”  or  “dampening”  of  the 
Oren-Luenberger  factor  7*.  Recall  that  in  standard  implementations  of  the  DFP  and  BFGS 
secant  updates  we  don’t  update  if  yTs  <  0,  and  we  are  guaranteed  that  yTs  >  0  near  the 
solution.  These  concerns  guarantee  that  Bk  will  be  positive  definite.  Hence,  no  generality  is 
lost  by  assuming  that  yTs,  yl.s-,  and  sTBs  are  all  positive  for  the  purpose  of  the  following 
proposition. 


Proposition  3.1  The  following  statements  are  equivalent: 
(i)  7 k  <  l; 
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(ii)  m)  <  1; 

(iii)  7*  <  7(0*). 

Proof:  The  proof  is  straightforward  once  we  recall  that  Bks-  =  y~.  n 

Proposition  3.2  The  scalar  7(0*)  sizes  Bk  relative  to  the  Hessian  of  f  and  is  of  degree  2. 
Proof:  The  proof  is  straightforward.  □ 

It  should  be  clear  that  a  centered  Oren-Luenberger  sizing  factor  of  degree  m  could 
be  defined  analogously  to  (3.1)  by  employing  the  quantities  sk,  Sfc-i ,  •  •  • ,  Sk-m+i  and  yk, 
yk-i,. . .  ,yjt_m+i-  Since  the  centered  Oren-Luenberger  sizing  factor  of  degree  2  gave  satis¬ 
factory  numerical  results  we  did  not  experiment  with  those  of  larger  degree. 

4  Selective  Sizing  Strategy 

Our  strategies,  for  both  the  BFGS  and  DFP,  are  to  size  at  the  first  iteration  using  the  Oren- 
Luenberger  sizing  factor  (1.5),  and  then  selectively  size  at  other  iterations  using  the  Oren- 
Luenberger  for  the  BFGS  update.  Recall  that  7 (0*)  denotes  the  centered  Oren-Luenberger 
sizing  factor  (3.1)  and  -yk  denotes  the  Oren-Luenberger  sizing  factor  Also  tx  and  e2  are 

small  positive  constants  and  77  and  r2  are  nonnegative  constants. 

General  Sizing  Strategy 
For  k  —  0  size  B0  by  max(e2,70). 

For  k  =  1,2,...  let  6k  =  min(77, r2||s*||). 

If  7 {Ok)  <  1  -  €1,  then  size  Bk  by  max(e2,7(0*)). 

DFP  Sizing  Strategy 
Choose  77  =  1  and  r2  large. 

BFGS  Sizing  Strategy 
Choose  77  =  1/2  and  r2  large. 

The  inclusion  of  the  features  ex ,  e2  and  r2  serve  as  safeguards  for  our  sizing  strategy.  The 
use  of  e2  prevents  us  from  inadvertently  creating  a  nearly  singular  matrix  by  sizing  with 
an  excessively  small  constant.  On  rare  occasions  this  feature  was  selected  in  our  numerical 
experiments.  The  use  of  ex  >  0  and  r2  >  0  ensures  that  sizing  will  eventually  be  shut  off. 
To  see  this  observe  that  with  r2  >  0  we  have  that  0k  — ►  0,  so  7 (0k)  — ►  1,  i.e.,  we  have  forced 
the  sizing  factor  to  converge  to  1;  in  contrast  to  depending  on,  as  yet,  not  fully  understood 
theory.  However,  in  our  numerical  experiments  described  in  the  following  section  we  did  not 
use  this  feature  and  always  sized  with  9k  —  1  for  DFP  and  9k  =  1/2  for  BFGS. 

The  choice  77  =  1  is  extremely  important  for  the  DFP  method.  It  has  been  our  experience 
that  DFP  loves  to  be  sized  by  the  Oren-Luenberger  sizing  factor.  On  the  other  hand  the 
choice  77  =  1/2  seems  to  be  extremely  important  for  the  BFGS  method.  It  seems  as  if  sizing 
the  BFGS  method  is  a  delicate  issue  especially  for  large  dimensional  problems,  and  using  2 
pieces  of  approximate  Hessian  information  is  significantly  better  than  using  just  1. 
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5  Numerical  Results 


The  sizing  procedures  of  Section  4  were  adapted  to  the  secant  trust-region  algorithm  in  the 
Dennis-Schnabel  code  [6].  This  code  uses  the  locally  constrained  optimal  step,  or  “Hook 
step”,  of  Hebden  (1973)  and  More  [10]  to  obtain  an  approximate  solution  to  the  trust-region 
subproblem.  The  code  allows  only  the  BFGS  secant  update;  so  we  added  the  option  of  using 
the  DFP  secant  update. 

Six  test  functions  plus  Oren's  Power  Function  were  selected  from  the  standard  test  set  of 
More,  Garbow  and  Hillstrom  [10].  In  the  following  table  we  use  the  notation  f(d).  Here  /  is 
an  integer  running  from  1  to  7  and  d  denotes  the  number  of  variables  or  function  arguments. 
We  use  /  =  1  to  signify  the  Helical  Valley  Function,  /  =  2  signifies  the  Penalty  Function, 
/  =  3  signifies  the  Extended  Powell  Singular  Function.,  /  =  4  signifies  Oren’s  Power  Func¬ 
tion,  /  =  5  signifies  the  Extended  Rosenbrock  Function,  /  =  6  signifies  the  Trigonometric 
Function,  and  finally  /  =  7  signifies  the  Wood  Function.  All  of  the  above  functions  can 
be  found  in  More,  Garbow,  and  Hillstrom  [10],  except  for  Oren's  Power  Function,  and  this 
latter  function  is 

f{x)  =  (xtAx)2 

where  A  =  [atJ],  with  atJ  =  i6t]  and  6,j  denotes  the  Dirac  delta  function. 

To  test  the  effectiveness  of  our  sizing  strategy,  we  modified  the  Dennis-Schnabel  optimiza¬ 
tion  code  to  accept  an  arbitrary  initial  matrix  B0,  in  its  unmodified  form  it  always  takes 
Bo  =  I  and  sizes  only  at  the  first  iteration.  We  chose  several  very  ill-conditioned  initial 
matrices  as  well  as  B0  =  I-  The  notation  D(a )  denotes  the  diagonal  matrix  with  all  diagonal 
elements  set  equal  to  a.  The  notation  D(a,b)  means  the  diagonal  matrix  which  alternates 
a  and  b  along  its  diagonal.  Initial  matrix  BTZ,  q  =  n  means  the  Byrd-Tapia- Zhang  matrix, 
Bq  =  diagonal (aa, . . .  ,an)  +  /,  where  a,-  =  (1  -  (^y))(10?  -  1).  The  starting  point  is  denoted 
by  io.  We  have  chosen  several  nonstandard  starting  points,  as  well  as  the  standard  starting 
points  that  cam  also  be  found  in  More,  Garbow,  and  Hillstrom.  In  all  our  experiments  we 
allowed  a  maximum  of  300  iterations.  The  letter  F  means  failure,  i.e.,  lack  of  convergence 
in  300  iterations.  The  stopping  criterion  used  is  the  one  used  by  the  Dennis-Schnabel  opti¬ 
mization  code,  i.e.,  we  define  convergence  when  the  relative  gradient  is  less  than  or  equal  to 
1.0  x  10~5.  The  machine  used  to  obtain  these  numerical  results  was  a  Sun  3/160. 

In  Table  2,  the  letters  “n.s.”  mean  we  never  size  Bk  before  updating;  “a.s.”  means  we 
always  size  Bk  when  updating;  “f.i.s.”  means  we  size  only  on  the  first  iteration.  “S.S(n.l.s)” 
means  two  things:  the  S.S.  part  denotes  the  number  of  iterations  it  took  the  algorithm  to 
converge  when  we  sized  Bk  as  described  in  Section  4,  and  the  (n.l.s)  part  means  out  of  the 
number  of  iterations  it  took  to  converge,  how  many  times  we  sized.  The  symbol  F'  means  the 
algorithm  stopped  very  close  to  the  answer  after  300  iterations.  Finally  (xi,  x2, 13,  x4, . .  .)“* 
means  repeat  this  set  of  numbers  until  the  dimension  of  the  problem. 

We  found  that  when  updating  Bk  using  the  BFGS  update  effective  choices  for  ej  and 
e2  were  ei  =  .05  and  e2  =  .1,  and  for  the  DFP  update  effective  choices  were  ei  =  .001 
and  e2  =  .1.  Our  results  were  not  particularly  sensitive  to  these  choices.  Our  choice  for 
t2  was  1.0  x  106;  hence  our  shut-off  feature  was  never  activated  and  we  selectively  sized 
until  convergence.  Table  1  is  self  explanatory  and  summarizes  our  numerical  results  for  the 
various  approaches  applied  to  our  27  problem  configurations. 


8 


Update 

Never  Sizing 

Always  Sizing 

la(  Iteration  Sizing 

Selective  Sizing 

BFGS 

12/27 

20/27 

15/27 

27/27 

DFP 

2/27 

24/27 

0/27 

26/27 

TABLE  1:  Summary  of  Success  Ratios 


6  Summary  and  Concluding  Remarks 

In  this  note  we  studied  the  effect  that  sizing  the  approximate  Hessian  has  on  the  BFGS  and 
DFP  trust-region  methods.  We  accomplished  this  objective  by  adapting  the  Dennis-Schnabel 
BFGS  trust-region  code  [6]  to  accept  a  user  supplied  initial  Hessian  approximation  (they  al¬ 
ways  use  the  identity)  and  to  allow  the  user  the  option  of  the  DFP  secant  update  as  well  as 
the  BFGS.  We  then  ran  the  code  on  several  standard  test  problems  from  the  optimization 
literature  using  various  starting  points  and  various  initial  Hessian  approximations.  In  or¬ 
der  to  seriously  challenge  the  algorithms  studied  we  considered  many  nonstandard  starting 
points  which  were  quite  far  from  the  solution  (as  well  as  the  standard  starting  points)  and 
several  extremely  ill-conditioned  initial  Hessian  approximations.  While  our  initial  Hessian 
approximations  w’ere  all  diagonal  matrices,  some  had  condition  numbers  as  large  as  10  . 
This  numerical  study  led  us  to  the  following  opinions. 

The  DFP  update  desperately  needs  to  be  sized  and  the  Oren-Luenberger  sizing  factor 
(1.5)  is  an  excellent  sizing  factor  for  this  update.  In  our  study  sizing  at  each  iteration  was 
only  slightly  inferior  to  our  selective  sizing  strategy,  and  at  this  point  we  are  not  prepared 
to  recommend  one  over  the  other.  Without  sizing,  the  DFP  update  is  vastly  inferior  to  the 
BFGS  update.  However,  when  properly  sized  it  is  superior  to  the  unsized  BFGS  update  and 
competitive  with  the  selectively  sized  BFGS  update.  Our  numerical  experience  has  led  us 
to  the  intuition  that  when  sizing  is  working,  the  sizing  factor  is  converging  to  1.  Moreover, 
always  sizing  seemed  to  demonstrate  this  behavior  somewhat  more  consistently  than  our 
selective  sizing  did  for  the  DFP  update. 

We  learned  that  sizing  the  BFGS  update  is  a  delicate  issue  for  large  dimensional  problems. 
Our  experience  led  us  to  the  belief  that  sizing  the  BFGS  update  with  a  factor  based  on 
approximate  Hessian  information  only  at  one  point  would  not  be  completely  effective.  For 
this  reason  we  developed  what  we  call  the  centered  Oren-Luenberger  sizing  factor  to  be  used 
with  the  BFGS  update.  It  uses  approximate  Hessian  information  gained  at  the  current  point 
and  at  the  previous  point  to  approximate  the  centers  of  the  respective  interval  spectra,  and 
both  intuitively  and  experimentally  seems  to  do  a  better  job  of  overlapping  the  spectrum 
of  the  approximate  Hessian  with  the  spectrum  of  the  Hessian.  While  more  research  and 
experimentation  is  needed  in  this  direction,  we  believe  that  we  have  presented  a  convincing 
case  that  it  is  a  viable  alternative  to  traditional  sizing.  Indeed  with  this  new  form  of  sizing 
we  obtained  a  success  rate  of  100%  for  the  problems  we  tried. 

For  both  the  BFGS  and  the  DFP  trust-region  methods  we  found  that,  in  general,  we 
should  size  when  our  sizing  factor  was  less  than  1,  and  not  size  when  it  was  greater  than  1. 
These  findings  are  clearly  reflected  in  our  sizing  strategy  described  in  Section  4,  and  we  offer 
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the  following  plausible  explanation. 

When  the  sizing  factor  is  greater  than  1,  Bk  is  “small”  relative  to  the  Hessian  in  some 
(as  yet  undefined)  sense;  hence  Sk  =  —B^1Vf(xk)  will  be  “large”.  However,  the  trust 
region  strategy  can  compensate  for  a  “large”  s, t;  it  merely  reduces  the  trust-region  radius 
and  resolves  the  subproblem.  On  the  other  hand  when  our  sizing  factor  is  less  than  1,  Bk 
is  “large”  in  this  undefined  sense;  hence  Sk  is  “small”.  The  trust-region  strategy  accepts  Sk, 
since  it  is  “small”  and  contained  in  the  trust  region.  This  could  continue  for  several  iterations; 
producing  very  slow  progress  towards  convergence,  since  the  steps  would  be  excessively 
“small”.  In  this  case  sizing  would  help  by  making  Bk+ 1  “smaller”  and  the  resulting  s*+1 
“larger” . 

We  believe  that  it  is  appropriate  to  end  this  section  with  a  discussion  which  demonstrates 
the  value,  the  beauty,  and  the  subtlety  of  sizing.  Let  us  consider  the  application  of  the 
gradient  method  to  the  strictly  convex  quadratic  functions  /(x)  with  Hessian  A.  The  gradient 
method  with  steplength  is  the  iterative  procedure 

xk+i  =  xk  -  akV f{xk)  .  (6.1) 

Bv  rewriting  (6.1)  as 

xk+\  =  Xk  —  (l/o;fc/)~1V/(xjt) ,  (6.2) 

we  can  view  the  gradient  method  with  steplength  a*  as  a  quasi-Newton  method  which 
employs  steplength  1,  but  sizes  the  approximate  Hessian  (the  identity  matrix)  with  the  sizing 
factor  7jt  =  aj"1.  The  standard  gradient  method  chooses  the  steplength  by  a  1-dimensional 
minimization  and  in  this  case  would  give  7 k  =  gl^k/gldk  where  gk  —  V/(x*).  This  is 
clearly  a  sizing  factor  for  I  and  has  some  of  the  flavor  of  the  Carter  sizing  factor  (1.7). 
Recently,  in  a  very  elegant  and  interesting  work  Raydan  [14]  demonstrated  that  if  instead 

one  chooses  the  steplength  by  choosing  the  sizing  factor  7^+1  =  where  sjt  =  Xk+i  —  Xk, 

then  superior  convergence  behavior  is  obtained  and  superlinear  convergence  results  for  a 
subclass  of  problems.  Notice  that  this  choice  of  steplength  is  equivalent  to  applying  Oren- 
Luenberger  sizing.  Raydan’s  theory  demonstrates  that  while  two  Rayleigh  quotients  may 
both  give  points  in  the  interval  spectrum  of  A,  one  may  have  hidden  properties  which  make 
it  particularly  effective  for  the  application  in  question. 
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